D² Log Loss Score (d2_log_loss_score)#

d2_log_loss_score is an R²-style metric for probabilistic classification.

Instead of asking “How accurate are the predicted labels?” it asks:

How much did my model reduce log loss compared to a null model that ignores features and only predicts class proportions?


Learning goals#

By the end you should be able to:

  • explain log loss as “surprise” / negative log-likelihood

  • define and interpret the D² log loss score

  • implement log_loss and d2_log_loss_score from scratch in NumPy

  • see (with plots) why overconfident wrong predictions are punished harshly

  • optimize a simple logistic regression with gradient descent and track D² during training

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import d2_log_loss_score, log_loss
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

Quick import#

from sklearn.metrics import d2_log_loss_score

d2_log_loss_score expects predicted probabilities (predict_proba output), not hard class labels.

Prerequisites#

  • binary vs multiclass classification

  • probabilities and logs

  • log loss / cross-entropy

If you know for regression:

\[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} \]

then D² is the same “fraction explained” idea, but with log loss instead of squared error.

1) Intuition: baseline vs your model#

Log loss as “surprise”#

For a single binary sample with true label \(y \in \{0, 1\}\) and predicted probability \(p = \Pr(y=1)\):

\[ \ell_{\log}(y, p) = -\Bigl(y\log(p) + (1-y)\log(1-p)\Bigr) \]
  • If you say “99%” and you’re right, you get a tiny loss.

  • If you say “99%” and you’re wrong, you get a huge loss.

That’s why log loss is great when probabilities matter: it rewards calibration and penalizes overconfidence.

D² turns log loss into an “explained fraction”#

D² compares your model to a null model that ignores features and always predicts the class proportions in the evaluation set.

  • If your model is no better than the null model: D² = 0

  • If your model is perfect: D² = 1

  • If your model is worse than the null model: D² < 0 (can be very negative)

So you can read D² as:

“How much of the baseline log loss did we eliminate?”

2) Definition (math)#

Let there be \(n\) samples and \(K\) classes.

  • True labels as one-hot indicators: \(y_{i,k} \in \{0,1\}\) with \(\sum_k y_{i,k} = 1\)

  • Predicted probabilities: \(p_{i,k}\) with \(\sum_k p_{i,k} = 1\)

  • Optional sample weights: \(w_i \ge 0\)

Log loss (sum form)#

Using the natural logarithm (units = nats):

\[ \operatorname{LL}(p) = -\sum_{i=1}^n w_i \sum_{k=1}^K y_{i,k}\,\log(p_{i,k}) \]

(You can divide by \(\sum_i w_i\) to get a mean log loss; the D² ratio is unchanged.)

Null model#

The null model predicts constant probabilities equal to the weighted class proportions:

\[ \pi_k = \frac{\sum_{i=1}^n w_i\,y_{i,k}}{\sum_{i=1}^n w_i} \qquad\Rightarrow\qquad p^{\text{null}}_{i,k} = \pi_k \]

D² log loss score#

\[ D^2_{\log} = 1 - \frac{\operatorname{LL}(p^{\text{model}})}{\operatorname{LL}(p^{\text{null}})} \]

Properties:

  • Best possible score is 1.

  • 0 means “no improvement over predicting class proportions”.

  • Can be negative and is unbounded below (log loss can blow up with confident wrong predictions).

Relationship to optimization#

For a fixed evaluation set, \(\operatorname{LL}(p^{\text{null}})\) is a constant.

So maximizing \(D^2_{\log}\) is equivalent to minimizing log loss:

\[\arg\max D^2_{\log} \;\equiv\; \arg\min \operatorname{LL}(p^{\text{model}})\]

3) NumPy from-scratch implementation#

We’ll implement two functions:

  • log_loss_numpy: the (weighted) cross-entropy / negative log-likelihood

  • d2_log_loss_score_numpy: the “fraction of log loss explained” relative to the null model

Notes:

  • We clip probabilities to avoid log(0).

  • For binary classification we allow probabilities shaped (n_samples,) (positive class) or (n_samples, 2).

import warnings


def sigmoid(z: np.ndarray) -> np.ndarray:
    # Numerically stable sigmoid
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z, dtype=float)

    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))

    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


def softmax(z: np.ndarray, axis: int = 1) -> np.ndarray:
    # Numerically stable softmax
    z = np.asarray(z, dtype=float)
    z = z - np.max(z, axis=axis, keepdims=True)
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=axis, keepdims=True)
def _as_float_array(a):
    return np.asarray(a, dtype=float)


def _check_sample_weight(sample_weight, n_samples: int) -> np.ndarray:
    if sample_weight is None:
        return np.ones(n_samples, dtype=float)
    w = _as_float_array(sample_weight)
    if w.shape != (n_samples,):
        raise ValueError(f"sample_weight must have shape ({n_samples},), got {w.shape}.")
    if np.any(w < 0):
        raise ValueError("sample_weight must be non-negative.")
    return w


def _infer_classes(y_true, labels):
    y_true = np.asarray(y_true)

    if labels is None:
        classes = np.unique(y_true)
        if classes.size < 2:
            raise ValueError(
                f"y_true contains only one label ({classes[0]!r}). "
                "Please provide the true labels explicitly through the labels argument."
            )
        return classes

    classes = np.asarray(labels)
    if classes.size < 2:
        raise ValueError("labels needs to contain at least two classes.")
    return classes


def _label_to_index(classes: np.ndarray) -> dict:
    return {label: i for i, label in enumerate(classes.tolist())}


def log_loss_numpy(
    y_true,
    y_proba,
    *,
    normalize: bool = True,
    sample_weight=None,
    labels=None,
    eps: float | None = None,
) -> float:
    # NumPy log loss for 1D label arrays.
    # Parameters mirror sklearn.metrics.log_loss (subset).

    y_true = np.asarray(y_true)
    y_proba = _as_float_array(y_proba)

    n_samples = y_true.shape[0]
    if y_proba.shape[0] != n_samples:
        raise ValueError("y_true and y_proba must have the same number of samples.")

    classes = _infer_classes(y_true, labels)
    label_to_idx = _label_to_index(classes)

    try:
        y_idx = np.array([label_to_idx[y] for y in y_true.tolist()], dtype=int)
    except KeyError as e:
        raise ValueError(f"Found unknown label {e.args[0]!r} in y_true.") from None

    # Shape handling: allow binary probs as (n,) or (n, 1)
    if y_proba.ndim == 1:
        if classes.size != 2:
            raise ValueError("1D y_proba is only supported for binary classification.")
        y_proba = np.column_stack([1.0 - y_proba, y_proba])

    if y_proba.ndim != 2:
        raise ValueError("y_proba must be 1D or 2D.")

    if y_proba.shape[1] == 1:
        if classes.size != 2:
            raise ValueError("(n, 1) y_proba is only supported for binary classification.")
        p = y_proba[:, 0]
        y_proba = np.column_stack([1.0 - p, p])

    if y_proba.shape[1] != classes.size:
        raise ValueError(
            f"Number of probability columns ({y_proba.shape[1]}) does not match number of classes ({classes.size})."
        )

    if eps is None:
        eps = np.finfo(y_proba.dtype).eps

    row_sums = y_proba.sum(axis=1)
    if not np.allclose(row_sums, 1.0, rtol=np.sqrt(eps)):
        warnings.warn(
            "The y_proba rows do not sum to 1. Make sure to pass probabilities.",
            UserWarning,
        )

    y_proba = np.clip(y_proba, eps, 1.0 - eps)

    p_true = y_proba[np.arange(n_samples), y_idx]
    per_sample_loss = -np.log(p_true)

    w = _check_sample_weight(sample_weight, n_samples)
    loss_sum = float(np.sum(w * per_sample_loss))

    if normalize:
        return loss_sum / float(np.sum(w))
    return loss_sum


def d2_log_loss_score_numpy(y_true, y_proba, *, sample_weight=None, labels=None) -> float:
    # D² log loss score (fraction of log loss explained).

    y_true = np.asarray(y_true)
    y_proba = _as_float_array(y_proba)

    n_samples = y_true.shape[0]
    if y_proba.shape[0] != n_samples:
        raise ValueError("y_true and y_proba must have the same number of samples.")

    if n_samples < 2:
        warnings.warn(
            "D^2 score is not well-defined with less than two samples.",
            RuntimeWarning,
        )
        return float("nan")

    numerator = log_loss_numpy(
        y_true,
        y_proba,
        normalize=False,
        sample_weight=sample_weight,
        labels=labels,
    )

    classes = _infer_classes(y_true, labels)
    label_to_idx = _label_to_index(classes)
    y_idx = np.array([label_to_idx[y] for y in y_true.tolist()], dtype=int)

    w = _check_sample_weight(sample_weight, n_samples)
    counts = np.bincount(y_idx, weights=w, minlength=classes.size)
    priors = counts / float(np.sum(w))
    y_proba_null = np.tile(priors, (n_samples, 1))

    denominator = log_loss_numpy(
        y_true,
        y_proba_null,
        normalize=False,
        sample_weight=sample_weight,
        labels=labels,
    )

    return 1.0 - (numerator / denominator)

Sanity check vs scikit-learn#

We’ll compare our NumPy implementation to:

  • sklearn.metrics.log_loss

  • sklearn.metrics.d2_log_loss_score

On typical inputs (at least 2 classes present), the values should match up to tiny floating-point differences.

# Binary sanity check
n = 500
y_true_bin = rng.integers(0, 2, size=n)

# Random probabilities (not great, but valid)
p_pos = np.clip(rng.uniform(0.02, 0.98, size=n), 1e-15, 1 - 1e-15)

print('log_loss sklearn:', log_loss(y_true_bin, p_pos))
print('log_loss numpy :', log_loss_numpy(y_true_bin, p_pos))
print('d2 sklearn      :', d2_log_loss_score(y_true_bin, p_pos))
print('d2 numpy        :', d2_log_loss_score_numpy(y_true_bin, p_pos))

# Multiclass sanity check
n = 600
k = 3
y_true_mc = rng.integers(0, k, size=n)
logits = rng.normal(size=(n, k))
proba = softmax(logits, axis=1)

print()
print('multiclass log_loss sklearn:', log_loss(y_true_mc, proba, labels=[0, 1, 2]))
print('multiclass log_loss numpy :', log_loss_numpy(y_true_mc, proba, labels=[0, 1, 2]))
print('multiclass d2 sklearn      :', d2_log_loss_score(y_true_mc, proba, labels=[0, 1, 2]))
print('multiclass d2 numpy        :', d2_log_loss_score_numpy(y_true_mc, proba, labels=[0, 1, 2]))
log_loss sklearn: 0.9737837898808718
log_loss numpy : 0.9737837898808718
d2 sklearn      : -0.4054570626616003
d2 numpy        : -0.4054570626616003

multiclass log_loss sklearn: 1.3819229190361464
multiclass log_loss numpy : 1.3819229190361464
multiclass d2 sklearn      : -0.258491560577216
multiclass d2 numpy        : -0.258491560577216

4) Plots: how log loss and D² behave#

4.1) Per-sample log loss curves (binary)#

For \(y=1\) the loss is \(-\log(p)\). For \(y=0\) it is \(-\log(1-p)\).

These curves explain the “overconfident wrong is catastrophic” behavior.

p = np.linspace(1e-6, 1 - 1e-6, 600)
loss_y1 = -np.log(p)
loss_y0 = -np.log(1 - p)

fig = go.Figure()
fig.add_trace(go.Scatter(x=p, y=loss_y1, mode='lines', name='y=1:  -log(p)'))
fig.add_trace(go.Scatter(x=p, y=loss_y0, mode='lines', name='y=0:  -log(1-p)'))
fig.update_layout(
    title='Binary log loss for a single sample',
    xaxis_title='Predicted probability p = P(y=1)',
    yaxis_title='Per-sample log loss (nats)',
    width=900,
    height=450,
)
fig.show()

4.2) The null-model log loss is (empirical) entropy#

In the binary case, the null model predicts the positive rate \(\hat{p}\) for everyone.

Its mean log loss is the empirical Bernoulli entropy:

\[ H(\hat{p}) = -\bigl(\hat{p}\log\hat{p} + (1-\hat{p})\log(1-\hat{p})\bigr) \]
  • Maximum at \(\hat{p}=0.5\) (most uncertainty)

  • Small near 0 or 1 (almost-deterministic labels)

This matters because D² scales improvements by this baseline.

p_hat = np.linspace(1e-6, 1 - 1e-6, 600)
entropy = -(p_hat * np.log(p_hat) + (1 - p_hat) * np.log(1 - p_hat))

fig = go.Figure(go.Scatter(x=p_hat, y=entropy, mode='lines'))
fig.update_layout(
    title='Baseline (null-model) mean log loss = Bernoulli entropy H(p̂)',
    xaxis_title='Positive rate p̂',
    yaxis_title='H(p̂) (nats)',
    width=900,
    height=450,
)
fig.show()

4.3) D² for constant predictors#

Let’s generate a dataset with a fixed class balance and score a constant predictor that always outputs \(q\).

  • At \(q = \hat{p}\) (the empirical positive rate), D² = 0 (that’s the null model).

  • Moving away from \(\hat{p}\) makes log loss worse → D² becomes negative.

This is a nice way to see D² as a scaled version of log loss.

n = 800
p_true = 0.25
y = rng.binomial(1, p_true, size=n)

q = np.linspace(1e-4, 1 - 1e-4, 500)

# Score constant predictors
logloss_vals = np.array([log_loss_numpy(y, np.full_like(y, qi, dtype=float)) for qi in q])
d2_vals = np.array([d2_log_loss_score_numpy(y, np.full_like(y, qi, dtype=float)) for qi in q])

p_hat = y.mean()

fig = make_subplots(rows=1, cols=2, subplot_titles=('Mean log loss', 'D² log loss score'))

fig.add_trace(go.Scatter(x=q, y=logloss_vals, mode='lines', name='log loss'), row=1, col=1)
fig.add_vline(x=p_hat, line_dash='dash', line_color='gray', row=1, col=1)
fig.update_xaxes(title_text='Constant probability q', row=1, col=1)
fig.update_yaxes(title_text='Mean log loss', row=1, col=1)

fig.add_trace(go.Scatter(x=q, y=d2_vals, mode='lines', name='D²'), row=1, col=2)
fig.add_hline(y=0.0, line_dash='dash', line_color='gray', row=1, col=2)
fig.add_vline(x=p_hat, line_dash='dash', line_color='gray', row=1, col=2)
fig.update_xaxes(title_text='Constant probability q', row=1, col=2)
fig.update_yaxes(title_text='D²', row=1, col=2)

fig.update_layout(
    title=f'Constant predictors on a dataset with p̂={p_hat:.3f}',
    width=1000,
    height=430,
    showlegend=False,
)
fig.show()

5) Using log loss (and D²) to optimize logistic regression (from scratch)#

Because D² is just a rescaling of log loss (by a constant baseline), training usually minimizes log loss directly.

Binary logistic regression#

Model with parameters \(\beta\):

\[ \eta_i = x_i^\top\beta \qquad p_i = \sigma(\eta_i) = \frac{1}{1 + e^{-\eta_i}} \]

Average log loss:

\[ J(\beta) = -\frac{1}{n}\sum_{i=1}^n \Bigl(y_i\log(p_i) + (1-y_i)\log(1-p_i)\Bigr) \]

Gradient (classic result):

\[ \nabla J(\beta) = \frac{1}{n} X^\top (p - y) \]

We’ll run gradient descent and track both:

  • log loss (lower is better)

  • D² log loss score (higher is better)

# Synthetic 2D classification problem
X, y = make_classification(
    n_samples=1200,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    class_sep=1.4,
    flip_y=0.06,
    weights=[0.6, 0.4],
    random_state=7,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=7, stratify=y
)

# Standardize (train statistics only)
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12

X_train_s = (X_train - mu) / sigma
X_test_s = (X_test - mu) / sigma

# Add intercept term
X_train_i = np.column_stack([np.ones(X_train_s.shape[0]), X_train_s])
X_test_i = np.column_stack([np.ones(X_test_s.shape[0]), X_test_s])

# Quick look at the data
fig = px.scatter(
    x=X_train_s[:, 0],
    y=X_train_s[:, 1],
    color=y_train.astype(str),
    title='Training data (standardized)',
    labels={'x': 'x1 (standardized)', 'y': 'x2 (standardized)', 'color': 'class'},
)
fig.update_layout(width=750, height=450)
fig.show()
def fit_logreg_gd(
    X: np.ndarray,
    y: np.ndarray,
    *,
    lr: float = 0.2,
    n_epochs: int = 250,
    l2: float = 0.0,
):
    # Gradient descent for binary logistic regression.
    # X: (n, d) with intercept column already included.
    # y: (n,) in {0,1}

    n, d = X.shape
    w = np.zeros(d, dtype=float)

    history = {
        'epoch': [],
        'train_log_loss': [],
        'train_d2': [],
        'w': [],
    }

    for epoch in range(n_epochs):
        p = sigmoid(X @ w)

        # Mean log loss (for nicer learning-rate behavior)
        loss = log_loss_numpy(y, p)

        # Gradient of mean log loss + optional L2
        grad = (X.T @ (p - y)) / n
        grad[1:] += l2 * w[1:]  # don't regularize intercept

        w -= lr * grad

        history['epoch'].append(epoch)
        history['train_log_loss'].append(loss)
        history['train_d2'].append(d2_log_loss_score_numpy(y, p))
        history['w'].append(w.copy())

    return w, history


w_hat, hist = fit_logreg_gd(X_train_i, y_train, lr=0.25, n_epochs=240, l2=0.0)

p_train = sigmoid(X_train_i @ w_hat)
p_test = sigmoid(X_test_i @ w_hat)

print('Final (train) log loss:', log_loss_numpy(y_train, p_train))
print('Final (test)  log loss:', log_loss_numpy(y_test, p_test))
print('Final (train) D²      :', d2_log_loss_score_numpy(y_train, p_train))
print('Final (test)  D²      :', d2_log_loss_score_numpy(y_test, p_test))
Final (train) log loss: 0.12075603565990804
Final (test)  log loss: 0.07421418195747594
Final (train) D²      : 0.8206929808323062
Final (test)  D²      : 0.8897282386094872
# Plot training curves
fig = make_subplots(rows=1, cols=2, subplot_titles=('Log loss (train)', 'D² log loss score (train)'))

fig.add_trace(
    go.Scatter(x=hist['epoch'], y=hist['train_log_loss'], mode='lines', name='log loss'),
    row=1,
    col=1,
)
fig.update_xaxes(title_text='epoch', row=1, col=1)
fig.update_yaxes(title_text='mean log loss', row=1, col=1)

fig.add_trace(
    go.Scatter(x=hist['epoch'], y=hist['train_d2'], mode='lines', name='D²'),
    row=1,
    col=2,
)
fig.add_hline(y=0.0, line_dash='dash', line_color='gray', row=1, col=2)
fig.update_xaxes(title_text='epoch', row=1, col=2)
fig.update_yaxes(title_text='D²', row=1, col=2)

fig.update_layout(width=1000, height=420, showlegend=False)
fig.show()

Compare with scikit-learn#

We’ll fit sklearn.linear_model.LogisticRegression and score the predicted probabilities.

Because D² is computed from log loss, you should typically see the same ranking as with log loss (higher D² ↔ lower log loss) on the same dataset.

clf = LogisticRegression(penalty=None, solver='lbfgs', max_iter=2000, random_state=7)
clf.fit(X_train_s, y_train)

p_test_sklearn = clf.predict_proba(X_test_s)[:, 1]

print('sklearn log loss:', log_loss(y_test, p_test_sklearn))
print('numpy  log loss:', log_loss_numpy(y_test, p_test_sklearn))

print('sklearn D²      :', d2_log_loss_score(y_test, p_test_sklearn))
print('numpy  D²      :', d2_log_loss_score_numpy(y_test, p_test_sklearn))
sklearn log loss: 0.06472355069849435
numpy  log loss: 0.06472355069849435
sklearn D²      : 0.9038299722408762
numpy  D²      : 0.9038299722408762

6) Pros, cons, and when to use it#

Pros#

  • Interpretable baseline: 0 means “no better than predicting class proportions”.

  • Proper-probability sensitivity (via log loss): rewards well-calibrated probabilities, not just correct labels.

  • Works for multiclass naturally.

  • R²-like scale: best is 1, worse-than-baseline is negative.

Cons / caveats#

  • Unbounded below: very confident wrong predictions can drive the score to large negative values.

  • Dataset-dependent scaling: the baseline log loss depends on class balance; comparing D² across very different datasets can be misleading.

  • Requires probabilities: hard labels (predict) aren’t enough.

  • Not meaningful with too few samples: scikit-learn returns NaN for n_samples < 2.

  • Edge cases with missing classes: if an evaluation fold is missing a class, you may need to pass labels= consistently.

Good use cases#

  • evaluating probabilistic classifiers where calibration matters (risk scores, medical, churn probability, etc.)

  • reporting an “explained fraction” style score to audiences familiar with

  • model selection within the same dataset / split (it ranks models the same way as log loss)

7) Pitfalls & diagnostics#

  • Overconfidence hurts: if your D² is very negative, inspect examples with high loss (often miscalibrated or shifted data).

  • Always check the null score: D²=0 is not “bad” — it means “you didn’t beat class proportions”.

  • Class imbalance: on highly imbalanced datasets, the null model can already achieve low log loss; large D² improvements may be harder.

  • Label ordering (binary with (n_samples,) probabilities): scikit-learn assumes the probability is for the “positive” class (the second class in sorted order).

  • Use calibration tools when needed: reliability diagrams, isotonic regression, Platt scaling.

Exercises#

  1. Create two classifiers with the same accuracy but different probability calibration. Compare log loss and D².

  2. For multiclass data, show how D² changes as you apply “temperature scaling” to logits.

  3. Add L2 regularization to the gradient descent logistic regression and observe how it affects log loss and D².

References#

  • scikit-learn API: sklearn.metrics.d2_log_loss_score

  • scikit-learn user guide: “D² score for classification” (see the docs page linked from the API)

  • C.M. Bishop (2006), Pattern Recognition and Machine Learning (cross-entropy / logistic regression)